This is a case study to analyze data and make predictions from the data collected by a FitBit Fitness Tracker. The data can be fetched from here.
After unzipping the files, its clearly visible that the files are not arranged in any meaningful way. Lets arrange the data according to the timeline they represent i.e., daily, hourly, minutely. The directory structure will look similar to this.
.
├── daily
│  ├── dailyActivity_merged.csv
│  ├── dailyCalories_merged.csv
│  ├── dailyIntensities_merged.csv
│  ├── dailySteps_merged.csv
│  └── sleepDay_merged.csv
├── heartrate_seconds_merged.csv
├── hourly
│  ├── hourlyCalories_merged.csv
│  ├── hourlyIntensities_merged.csv
│  └── hourlySteps_merged.csv
├── minutes
│  ├── minuteCaloriesNarrow_merged.csv
│  ├── minuteCaloriesWide_merged.csv
│  ├── minuteIntensitiesNarrow_merged.csv
│  ├── minuteIntensitiesWide_merged.csv
│  ├── minuteMETsNarrow_merged.csv
│  ├── minuteSleep_merged.csv
│  ├── minuteStepsNarrow_merged.csv
│  └── minuteStepsWide_merged.csv
└── weightLogInfo_merged.csv
Before diving in the data let’s first install the required libraries and include them.
install.packages("readr")
install.packages("dplyr")
install.packages("ggplot2")
install.packages("hms")
install.packages("plotly")
Now let’s import them in our memory.
library(readr)
library(dplyr)
library(ggplot2)
library(hms)
library(plotly)
library(gridExtra)
Now we have our tools and are ready to dive in, we will start by importing all files from daily folder into our program.
dailyActivity_merged <- read_csv("archive/Fitabase Data 4.12.16-5.12.16/daily/dailyActivity_merged.csv")
dailyCalories_merged <- read_csv("archive/Fitabase Data 4.12.16-5.12.16/daily/dailyCalories_merged.csv")
dailyIntensities_merged <- read_csv("archive/Fitabase Data 4.12.16-5.12.16/daily/dailyIntensities_merged.csv")
dailySteps_merged <- read_csv("archive/Fitabase Data 4.12.16-5.12.16/daily/dailySteps_merged.csv")
The purpose of this is to make sure our data is in the right format and if it has any NA values.
head(dailyActivity_merged)
## # A tibble: 6 × 15
## Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitie…
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1.50e9 4/12/2016 13162 8.5 8.5 0
## 2 1.50e9 4/13/2016 10735 6.97 6.97 0
## 3 1.50e9 4/14/2016 10460 6.74 6.74 0
## 4 1.50e9 4/15/2016 9762 6.28 6.28 0
## 5 1.50e9 4/16/2016 12669 8.16 8.16 0
## 6 1.50e9 4/17/2016 9705 6.48 6.48 0
## # … with 9 more variables: VeryActiveDistance <dbl>,
## # ModeratelyActiveDistance <dbl>, LightActiveDistance <dbl>,
## # SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>,
## # FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## # SedentaryMinutes <dbl>, Calories <dbl>
head(dailyCalories_merged)
## # A tibble: 6 × 3
## Id ActivityDay Calories
## <dbl> <chr> <dbl>
## 1 1503960366 4/12/2016 1985
## 2 1503960366 4/13/2016 1797
## 3 1503960366 4/14/2016 1776
## 4 1503960366 4/15/2016 1745
## 5 1503960366 4/16/2016 1863
## 6 1503960366 4/17/2016 1728
head(dailyIntensities_merged)
## # A tibble: 6 × 10
## Id ActivityDay SedentaryMinutes LightlyActiveMinutes FairlyActiveMinu…
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1503960366 4/12/2016 728 328 13
## 2 1503960366 4/13/2016 776 217 19
## 3 1503960366 4/14/2016 1218 181 11
## 4 1503960366 4/15/2016 726 209 34
## 5 1503960366 4/16/2016 773 221 10
## 6 1503960366 4/17/2016 539 164 20
## # … with 5 more variables: VeryActiveMinutes <dbl>,
## # SedentaryActiveDistance <dbl>, LightActiveDistance <dbl>,
## # ModeratelyActiveDistance <dbl>, VeryActiveDistance <dbl>
head(dailySteps_merged)
## # A tibble: 6 × 3
## Id ActivityDay StepTotal
## <dbl> <chr> <dbl>
## 1 1503960366 4/12/2016 13162
## 2 1503960366 4/13/2016 10735
## 3 1503960366 4/14/2016 10460
## 4 1503960366 4/15/2016 9762
## 5 1503960366 4/16/2016 12669
## 6 1503960366 4/17/2016 9705
After checking each column we can confirm that our data is in right format and there are no discrepancies.
Now we will check if there are any duplicate rows in the files. We will do this by defining a function which returns the number of duplicate rows.
count_duplicates <- function(dataframe){
n <- dataframe %>% nrow() - dataframe %>% unique() %>% nrow() #number of duplicate rows
return(n)
}
Now let’s call this function for every file.
count_duplicates(dailyActivity_merged)
## [1] 0
count_duplicates(dailyCalories_merged)
## [1] 0
count_duplicates(dailyIntensities_merged)
## [1] 0
count_duplicates(dailySteps_merged)
## [1] 0
Now all rows are unique and we will check for NA values in all the files.
dailyActivity_merged %>% is.na() %>% which()
## integer(0)
dailyCalories_merged %>% is.na() %>% which()
## integer(0)
dailyIntensities_merged %>% is.na() %>% which()
## integer(0)
dailySteps_merged %>% is.na() %>% which()
## integer(0)
There are no NA values in any of the files and our cleaning process is done.
Before we dive into the Process phase of our analysis process it’s important that we get familiar with the data first. Let’s check out the column names and try to find relations in the files.
We can use the colnames() function to see the column names of the files.
colnames(dailyActivity_merged)
## [1] "Id" "ActivityDate"
## [3] "TotalSteps" "TotalDistance"
## [5] "TrackerDistance" "LoggedActivitiesDistance"
## [7] "VeryActiveDistance" "ModeratelyActiveDistance"
## [9] "LightActiveDistance" "SedentaryActiveDistance"
## [11] "VeryActiveMinutes" "FairlyActiveMinutes"
## [13] "LightlyActiveMinutes" "SedentaryMinutes"
## [15] "Calories"
colnames(dailyCalories_merged)
## [1] "Id" "ActivityDay" "Calories"
colnames(dailyIntensities_merged)
## [1] "Id" "ActivityDay"
## [3] "SedentaryMinutes" "LightlyActiveMinutes"
## [5] "FairlyActiveMinutes" "VeryActiveMinutes"
## [7] "SedentaryActiveDistance" "LightActiveDistance"
## [9] "ModeratelyActiveDistance" "VeryActiveDistance"
colnames(dailySteps_merged)
## [1] "Id" "ActivityDay" "StepTotal"
Upon inspecting this data we find 3 things:
The 3rd point of out observation implies that we can get rid of all the files except dailyActivity_merged.
In this phase we will get rid of redundant elements and rename some rows.
All these changes will be saved to a new dataframe called daily_activity.
options(width = 1500)
daily_activity <- dailyActivity_merged %>% transform(active_minutes = VeryActiveMinutes+FairlyActiveMinutes+LightlyActiveMinutes, ActivityDate = as.Date(ActivityDate, format = "%m/%d/%Y") ,weekday = weekdays(as.Date(ActivityDate, format = "%m/%d/%Y"))) %>% rename(inactive_minutes = SedentaryMinutes, date = ActivityDate, total_steps = TotalSteps, total_distance = TotalDistance) %>% select(Id, date, weekday, total_steps, total_distance, active_minutes, inactive_minutes, Calories)
colnames(daily_activity) <- tolower(colnames(daily_activity))
Our new data-frame looks something like this
## id date weekday total_steps total_distance active_minutes inactive_minutes calories
## 1 1503960366 2016-04-12 Tuesday 13162 8.50 366 728 1985
## 2 1503960366 2016-04-13 Wednesday 10735 6.97 257 776 1797
## 3 1503960366 2016-04-14 Thursday 10460 6.74 222 1218 1776
## 4 1503960366 2016-04-15 Friday 9762 6.28 272 726 1745
## 5 1503960366 2016-04-16 Saturday 12669 8.16 267 773 1863
## 6 1503960366 2016-04-17 Sunday 9705 6.48 222 539 1728
Let’s create a scatter plot for total_steps and total_distance, we assume it to be directly proportional as total_distance should increase linearly with total_steps
As it is clearly visible that our hypothesis was correct, this also means that total_distance is a redundant attribute and we can use total_steps only. We can also check this corelation by using cor() like this:
cor(daily_activity$total_steps, daily_activity$total_distance)
## [1] 0.9853688
The 0.98 value signifies that there is a strong relationship between total_steps and total_distance.
Now lets plot total_steps, active_minutes and inactive_minutes against calories, our hypothesis is that 1. total_steps directly proportional to calories
2. active_minutes directly proportional to calories 3. inactive_minutes inversely proportional to calories
To understand the relationship more easily let’s add a regression line to each plot.
As it’s clear that our hypothesis is correct as the slope for total_steps/calories and active_minutes/calories is positive, it shows linear growth and inactive_minutes/calories is negative. The relationship is not really strong as there is lot of variance in the data. It is clear by the values of the correlations(in red) that relationship is not strong and if we try to fit a linear model it will not be an apt one.
Now that we have plotted and analyzed the relationship in raw data, lets find mean values of the attributes through the week and analyze it.
order_days <- function(data, x){
data$weekday <- factor(data$weekday, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))
return(data)
}
This function will order days in general Monday to Saturday order.
mean_daily <- daily_activity %>% group_by(weekday) %>% summarize(mean_active = mean(active_minutes), mean_inactive = mean(inactive_minutes), mean_steps = mean(total_steps), mean_calories = mean(calories))
mean_daily <- order_days(mean_daily)
head(mean_daily, 7)
## # A tibble: 7 × 5
## weekday mean_active mean_inactive mean_steps mean_calories
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Friday 236. 1000. 7448. 2332.
## 2 Monday 229. 1028. 7781. 2324.
## 3 Saturday 244. 964. 8153. 2355.
## 4 Sunday 208. 990. 6933. 2263
## 5 Thursday 217. 962. 7406. 2200.
## 6 Tuesday 235. 1007. 8125. 2356.
## 7 Wednesday 224. 989. 7559. 2303.
This is the summarized data, we will visualize it now.
By analyzing the graphs we can find that people are 1. Most active on Saturday and Tuesday 2. Least active around Thursday and Sunday.
We will also analyze sleep and heartbeat data so let’s import sleep data first called sleepDay_merged
sleepDay_merged <- read_csv("archive/Fitabase Data 4.12.16-5.12.16/daily/sleepDay_merged.csv")
We should first check for duplicate enties and if there are, remove them.
count_duplicates(sleepDay_merged)
## [1] 3
sleepDay_merged <- unique(sleepDay_merged)
count_duplicates(sleepDay_merged)
## [1] 0
Now that we have eliminated duplicates, let’s check for NA values
sleepDay_merged %>% is.na() %>% which()
## integer(0)
head(sleepDay_merged)
## # A tibble: 6 × 5
## Id SleepDay TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM 1 327 346
## 2 1503960366 4/13/2016 12:00:00 AM 2 384 407
## 3 1503960366 4/15/2016 12:00:00 AM 1 412 442
## 4 1503960366 4/16/2016 12:00:00 AM 2 340 367
## 5 1503960366 4/17/2016 12:00:00 AM 1 700 712
## 6 1503960366 4/19/2016 12:00:00 AM 1 304 320
There are no NA values. Now we can move on to cleaning the data-frame. As we can see above that in SleepDay time is constant so we will only consider the date. We will change the SleepDay to date format, calculate weekday, and rename a few rows.
sleep_minutes <- sleepDay_merged %>% rename(id = Id, date = SleepDay, sleep_time = TotalMinutesAsleep, count = TotalSleepRecords, bed_time = TotalTimeInBed) %>% transform(date = as.Date(date, "%m/%d/%Y %I:%M:%S %p")) %>% transform(weekday = weekdays.Date(date)) %>% select(id, date, weekday, count, sleep_time, bed_time)
Let’s plot some graphs to analyze our data. We will plot the following relations: 1. sleep_time VS bed_time 2. Mean sleep_time per weekday 3. Mean count per weekday
Let’s plot sleep_time VS bed_time
cor(sleep_minutes$sleep_time, sleep_minutes$bed_time)
## [1] 0.9304224
As we can see that data is highly linear and there is a strong linear relationship between these attributes, we will fit a regression model to this later on.
Now we will calculate a summary table to find means.
sleep_mean <- sleep_minutes %>% group_by(weekday) %>% summarize(mean_sleep_time = mean(sleep_time), mean_count = mean(count))
sleep_mean <- order_days(sleep_mean)
head(sleep_mean, 7)
## # A tibble: 7 × 3
## weekday mean_sleep_time mean_count
## <fct> <dbl> <dbl>
## 1 Friday 405. 1.07
## 2 Monday 420. 1.11
## 3 Saturday 419. 1.19
## 4 Sunday 453. 1.18
## 5 Thursday 401. 1.03
## 6 Tuesday 405. 1.11
## 7 Wednesday 435. 1.15
Let’s plot mean_sleep_time per weekday.
We can conclude that people sleep: 1. Most on Wednesday and weekends(Sunday and Saturday) 2. Least on Thursday and Tuesday
We really don’t need to plot mean_count, we can just sort the table in descending order of mean_count.
## # A tibble: 7 × 2
## weekday mean_count
## <fct> <dbl>
## 1 Saturday 1.19
## 2 Sunday 1.18
## 3 Wednesday 1.15
## 4 Monday 1.11
## 5 Tuesday 1.11
## 6 Friday 1.07
## 7 Thursday 1.03
We can see that people take:
We also concluded the same result from the previous plot.
Now let’s perform inner join on sleep_minutes and daily_activity
sleep_daily <- merge(daily_activity, sleep_minutes, .by = id, .by= date)
This is what the resultant data frame looks like now.
head(sleep_daily)
## id date weekday total_steps total_distance active_minutes inactive_minutes calories count sleep_time bed_time
## 1 1503960366 2016-04-12 Tuesday 13162 8.50 366 728 1985 1 327 346
## 2 1503960366 2016-04-13 Wednesday 10735 6.97 257 776 1797 2 384 407
## 3 1503960366 2016-04-15 Friday 9762 6.28 272 726 1745 1 412 442
## 4 1503960366 2016-04-16 Saturday 12669 8.16 267 773 1863 2 340 367
## 5 1503960366 2016-04-17 Sunday 9705 6.48 222 539 1728 1 700 712
## 6 1503960366 2016-04-19 Tuesday 15506 9.88 345 775 2035 1 304 320
let’s find correlation between sleep data and daily activity data.
As you can see this data is very random and there aren’t any insights we can gain from it.
People take:
1. Most naps on Saturday, Sunday and Wednesday. 2. Least naps on Thursday, Friday and Tuesday.
People Sleep: 1. Most on Wednesday and weekends(Sunday and Saturday) 2. Least on Thursday and Tuesday
Now let’s analyze data that’s collected hourly and try to find trends.
hourlyCalories_merged <- read_csv("archive/Fitabase Data 4.12.16-5.12.16/hourly/hourlyCalories_merged.csv")
hourlyIntensities_merged <- read_csv("archive/Fitabase Data 4.12.16-5.12.16/hourly/hourlyIntensities_merged.csv")
hourlySteps_merged <- read_csv("archive/Fitabase Data 4.12.16-5.12.16/hourly/hourlySteps_merged.csv")
Lets check for duplicates before diving into analyzing.
count_duplicates(hourlyCalories_merged)
## [1] 0
count_duplicates(hourlyIntensities_merged)
## [1] 0
count_duplicates(hourlySteps_merged)
## [1] 0
As you can see there are no duplicates in these files, now let’s check for any NA values.
hourlyCalories_merged %>% is.na() %>% which()
## integer(0)
hourlyIntensities_merged %>% is.na() %>% which()
## integer(0)
hourlySteps_merged %>% is.na() %>% which()
## integer(0)
There are no NA values either, now let’s take a look into our data.
head(hourlyCalories_merged)
## # A tibble: 6 × 3
## Id ActivityHour Calories
## <dbl> <chr> <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM 81
## 2 1503960366 4/12/2016 1:00:00 AM 61
## 3 1503960366 4/12/2016 2:00:00 AM 59
## 4 1503960366 4/12/2016 3:00:00 AM 47
## 5 1503960366 4/12/2016 4:00:00 AM 48
## 6 1503960366 4/12/2016 5:00:00 AM 48
head(hourlyIntensities_merged)
## # A tibble: 6 × 4
## Id ActivityHour TotalIntensity AverageIntensity
## <dbl> <chr> <dbl> <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM 20 0.333
## 2 1503960366 4/12/2016 1:00:00 AM 8 0.133
## 3 1503960366 4/12/2016 2:00:00 AM 7 0.117
## 4 1503960366 4/12/2016 3:00:00 AM 0 0
## 5 1503960366 4/12/2016 4:00:00 AM 0 0
## 6 1503960366 4/12/2016 5:00:00 AM 0 0
head(hourlySteps_merged)
## # A tibble: 6 × 3
## Id ActivityHour StepTotal
## <dbl> <chr> <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM 373
## 2 1503960366 4/12/2016 1:00:00 AM 160
## 3 1503960366 4/12/2016 2:00:00 AM 151
## 4 1503960366 4/12/2016 3:00:00 AM 0
## 5 1503960366 4/12/2016 4:00:00 AM 0
## 6 1503960366 4/12/2016 5:00:00 AM 0
They all have Id, ActivityHour in common, let’s count number of rows in each table.
nrow(hourlyCalories_merged)
## [1] 22099
nrow(hourlyIntensities_merged)
## [1] 22099
nrow(hourlySteps_merged)
## [1] 22099
As we can see that they all have the same number of rows, we can join them and analyze data collectively.
hourlyCalories_merged$ActivityHour <- as.POSIXct(hourlyCalories_merged$ActivityHour, format = "%m/%d/%Y %I:%M:%S %p")
hourlyIntensities_merged$ActivityHour <- as.POSIXct(hourlyIntensities_merged$ActivityHour, format = "%m/%d/%Y %I:%M:%S %p")
hourlySteps_merged$ActivityHour <- as.POSIXct(hourlySteps_merged$ActivityHour, format = "%m/%d/%Y %I:%M:%S %p")
one <- merge(hourlyCalories_merged, hourlySteps_merged, .by = Id, .by = ActivityHour)
hourly_merged <- merge(one, hourlyIntensities_merged, .by = Id, .by = ActivityHour)
Let’s take a look at hourly_merged now.s
head(hourly_merged)
## Id ActivityHour Calories StepTotal TotalIntensity AverageIntensity
## 1 1503960366 2016-04-12 00:00:00 81 373 20 0.333333
## 2 1503960366 2016-04-12 01:00:00 61 160 8 0.133333
## 3 1503960366 2016-04-12 02:00:00 59 151 7 0.116667
## 4 1503960366 2016-04-12 03:00:00 47 0 0 0.000000
## 5 1503960366 2016-04-12 04:00:00 48 0 0 0.000000
## 6 1503960366 2016-04-12 05:00:00 48 0 0 0.000000
As we have already analyzed data through dates we are only concerned with the time now, let’s make a function to extract time from date-time format.
extract_time <- function(data){
data$ActivityHour <- as.POSIXct(data$ActivityHour, format = "%m/%d/%Y %I:%M:%S %p")
data$hour <- format(data$ActivityHour, "%H")
return(data)
}
hourly_merged <- extract_time(hourly_merged)
head(hourly_merged)
## Id ActivityHour Calories StepTotal TotalIntensity AverageIntensity hour
## 1 1503960366 2016-04-12 00:00:00 81 373 20 0.333333 00
## 2 1503960366 2016-04-12 01:00:00 61 160 8 0.133333 01
## 3 1503960366 2016-04-12 02:00:00 59 151 7 0.116667 02
## 4 1503960366 2016-04-12 03:00:00 47 0 0 0.000000 03
## 5 1503960366 2016-04-12 04:00:00 48 0 0 0.000000 04
## 6 1503960366 2016-04-12 05:00:00 48 0 0 0.000000 05
Let’s rename the rows and eliminate the average intensity row.
hourly_merged <- hourly_merged %>% rename(id = Id, time = hour, steps = StepTotal, intensity = TotalIntensity, calories = Calories, date_time = ActivityHour) %>% select(id, time, steps, intensity, calories, date_time)
## id time steps intensity calories date_time
## 1 1503960366 00 373 20 81 2016-04-12 00:00:00
## 2 1503960366 01 160 8 61 2016-04-12 01:00:00
## 3 1503960366 02 151 7 59 2016-04-12 02:00:00
## 4 1503960366 03 0 0 47 2016-04-12 03:00:00
## 5 1503960366 04 0 0 48 2016-04-12 04:00:00
## 6 1503960366 05 0 0 48 2016-04-12 05:00:00
Let’s now plot relations between steps, calories and intensity.
As we can see from the plots and scatter plot they are mostly linearly increasing and it is also clear from the correlation values on the graph(in red) that the attributes have a good correlation between them. We shall fit a linear model to it later on.
Let’s try to find some trends from this data on different times of a day.
hourly_mean <- hourly_merged %>% group_by(time) %>% summarize(mean_steps = mean(steps), mean_intensity = mean(intensity), mean_calories = mean(calories)) %>% transform(time = as.integer(time))
head(hourly_mean)
## time mean_steps mean_intensity mean_calories
## 1 0 42.188437 2.1295503 71.80514
## 2 1 23.102894 1.4190782 70.16506
## 3 2 17.110397 1.0439443 69.18650
## 4 3 6.426581 0.4437299 67.53805
## 5 4 12.699571 0.6330472 68.26180
## 6 5 43.869099 4.9506438 81.70815
Let’s plot these values to get a better look.
As You can see the graphs for these different attributes are really similar, as the correlation between these variables was good to so it makes perfect sense.
From these plots we can conclude that 1. People are most active (overall) around 6PM and there is a dip at 2PM where they are least active throughout the day. 2. They are least active during night time from around 10PM to 6AM.
steps_distance <- daily_activity %>% rename(steps = total_steps, distance = total_distance) %>% select(distance, steps)
head(steps_distance)
## distance steps
## 1 8.50 13162
## 2 6.97 10735
## 3 6.74 10460
## 4 6.28 9762
## 5 8.16 12669
## 6 6.48 9705
This is what the data looks like when it is plotted.
The value in red represents the correlation between the attributes and as it is 0.985 this represents there is a strong relationship between these attributes. We can see this through the plot tooo, they are increasing constantly.
Let’s define a function that we will use for other regressions too. This will split the data into test and train data.
sample_test <- function(dataframe, percent){
sample_size = floor((percent/100)*nrow(dataframe))
set.seed(123)
train_index <- sample(seq_len(nrow(dataframe)), size = sample_size)
train <- dataframe[train_index, ]
test <- dataframe[-train_index, ]
return(list(train, test))
}
Let’s not split the data for training.
train_test_daily <- sample_test(steps_distance, 80)
train_data <- as.data.frame(train_test_daily[1])
test_data <- as.data.frame(train_test_daily[2])
Linear regression model for our data.
model <- lm(steps~distance, train_data)
summary(model)
##
## Call:
## lm(formula = steps ~ distance, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6981.0 -529.5 -110.1 477.7 3178.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 615.862 53.313 11.55 <2e-16 ***
## distance 1276.578 7.934 160.90 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 857.5 on 750 degrees of freedom
## Multiple R-squared: 0.9718, Adjusted R-squared: 0.9718
## F-statistic: 2.589e+04 on 1 and 750 DF, p-value: < 2.2e-16
Our equation for the linear line looks like this:
\(steps = 1276.578*distance + 615.862\)
And the accuracy of this model can be defined by R-squared which is \(97.18\)% for our model.
Let’s use this model to predict values from test_data
predicted_data <- test_data %>% mutate(predicted_steps = predict(model, test_data))
head(predicted_data)
## distance steps predicted_steps
## 1 8.50 13162 11466.776
## 3 6.74 10460 9219.998
## 7 8.59 13019 11581.668
## 9 6.68 10544 9143.404
## 12 9.04 14371 12156.128
## 22 9.66 15103 12947.606
As you can see few of the actual and predicted values. They are pretty close to out model and will work as the model has 97% accuracy.
The linear line going through the data represents the predictions our model came up with. This model fits our data well and it’s usable.
This data is available in sleep_minutes data frame, let’s fetch it.
sleep_bed = sleep_minutes %>% select(sleep_time, bed_time)
head(sleep_bed)
## sleep_time bed_time
## 1 327 346
## 2 384 407
## 3 412 442
## 4 340 367
## 5 700 712
## 6 304 320
Let’s plot it first.
As it is clear from our plot above that data shows good correlation which means that we can fit a linear model to it.
Let’s not split the data for training.
train_test_daily <- sample_test(sleep_bed, 80)
train_data <- as.data.frame(train_test_daily[1])
test_data <- as.data.frame(train_test_daily[2])
Linear regression model for our data.
model <- lm(bed_time~sleep_time, train_data)
summary(model)
##
## Call:
## lm(formula = bed_time ~ sleep_time, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41.91 -23.41 -14.56 0.93 332.03
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.35427 9.71001 4.362 1.73e-05 ***
## sleep_time 0.99427 0.02243 44.332 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 48.43 on 326 degrees of freedom
## Multiple R-squared: 0.8577, Adjusted R-squared: 0.8573
## F-statistic: 1965 on 1 and 326 DF, p-value: < 2.2e-16
Our equation for the linear line looks like this:
\(bed time = 0.99427*sleeptime + 42.35427\)
And the accuracy of this model can be defined by R-squared which is \(85.77\)% for our model.
Let’s use this model to predict values from test_data.
predicted_data <- test_data %>% mutate(predicted_steps = predict(model, test_data))
head(predicted_data)
## sleep_time bed_time predicted_steps
## 1 327 346 367.4806
## 3 412 442 451.9936
## 6 304 320 344.6124
## 15 404 425 444.0394
## 17 277 309 317.7671
## 18 273 296 313.7900
As you can see that predicted values are close enough to actual values.
Let’s plot the data to get a better look of our model vs our actual data.
The linear line going through the data represents the predictions our model came up with. This model fits our data well and it’s usable.
This data is available in hourly_merged data frame, let’s fetch it.
steps_inten_cal = hourly_merged %>% select(steps, intensity, calories)
head(steps_inten_cal)
## steps intensity calories
## 1 373 20 81
## 2 160 8 61
## 3 151 7 59
## 4 0 0 47
## 5 0 0 48
## 6 0 0 48
In this we data we are aiming at predicting the calories burned given steps and intensity of an hour.
This is what our data looks like if we plot a 3d scatter plot. We will split the data for model training first and then apply the model.
train_test_daily <- sample_test(steps_inten_cal, 80)
train_data <- as.data.frame(train_test_daily[1])
test_data <- as.data.frame(train_test_daily[2])
model <- lm(calories~steps+intensity, train_data)
summary(model)
##
## Call:
## lm(formula = calories ~ steps + intensity, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -205.87 -15.07 -1.35 17.18 398.27
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.682e+01 2.309e-01 289.371 < 2e-16 ***
## steps 4.257e-03 6.639e-04 6.412 1.47e-10 ***
## intensity 2.433e+00 2.157e-02 112.797 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26.58 on 17676 degrees of freedom
## Multiple R-squared: 0.8021, Adjusted R-squared: 0.8021
## F-statistic: 3.583e+04 on 2 and 17676 DF, p-value: < 2.2e-16
This is the summary of our model. As you can see that the model has an accuracy of 80.21% which is not that good but it works for us. Let’s predict using our test data.
predicted_data <- test_data %>% mutate(predicted_calories = predict(model, test_data))
head(predicted_data)
## steps intensity calories predicted_calories
## 3 151 7 59 84.49774
## 8 0 0 47 66.82258
## 12 360 12 76 97.55342
## 24 338 21 81 119.35848
## 34 1679 27 136 139.66636
## 35 295 15 77 104.57628
As you can see the predictions are not perfect but close enough for us. We can plot the predictions against the actual result for a better visual.
As you can see there is a lot of variance in our data and this is as best our linear regression model will get.
In this case study we analyzed daily and hourly data collected by a Fitbit FitBit Fitness Tracker. Our aim was to find various trends through the data. The data included attributes such as calories burnt, steps taken, distance traveled, sleep time ,active and inactive minutes, across different days and hours of the day. We came to a conclusion after analyzing that:
People are: 1. Most active on Saturday and Tuesday 2. Least active around Thursday and Sunday.
People take:
1. Most naps on Saturday, Sunday and Wednesday. 2. Least naps on Thursday, Friday and Tuesday.
People Sleep: 1. Most on Wednesday and weekends(Sunday and Saturday) 2. Least on Thursday and Tuesday.
| Attributes | Daily_Mean | Hourly_Mean |
|---|---|---|
| Steps | 7629.360000 | 321.43040 |
| Distance | 5.489702 | NA |
| Calories | 2304.595000 | 97.50392 |
| Inactive Minutes | 991.660700 | NA |
| Active Minutes | 227.634200 | NA |
| Sleep Minutes | 419.607700 | NA |
| Intensity | NA | 12.08221 |